Four dataframes are imported:
data_ori: It contains the original data with labels.data_ss: It contains the semi-supervised data, which includes both labeled and unlabeled data.data_label: It contains the labeled data of the unlabeled data.data_test: It contains the test data.The data was simulated using the function make_blobs from the sklearn.datasets library in Python. The function generates Gaussian blobs for clustering. The data was generated with the following parameters:
| Data | Rows | Columns | labeled | unlabeled |
|---|---|---|---|---|
| Original Data | 17000 | 5 | 17000 | 0 |
| Semi Supervised Data | 13600 | 5 | 136 | 13464 |
| Labeled Data | 13464 | 5 | 13464 | 0 |
| Test Data | 3400 | 5 | 3400 | 0 |
The boxplots below show the distribution of the features in the original data, semi-supervised data, labeled data, and test data. The boxplots will help us to visualize the distribution of the features and to identify any outliers in the data.
df %>%
filter(source != "Unlabeled Data") %>%
ggplot(aes(x = source, y = X0)) +
geom_boxplot() +
geom_jitter(aes(color = source), alpha = 0.5, width = 0.1) +
labs(title = "Distribution of Feature 0", x = "Source", y = "Feature 0") +
theme_bw() +
facet_wrap(~target, scales = "free") +
theme(legend.position = "bottom",
plot.title = element_text(size = 14,hjust = 0.5),
legend.title = element_blank(),
axis.text.x = element_blank())
df %>%
filter(source != "Unlabeled Data") %>%
ggplot(aes(x = source, y = X1)) +
geom_boxplot() +
geom_jitter(aes(color = source), alpha = 0.5, width = 0.1) +
labs(title = "Distribution of Feature 1", x = "Source", y = "Feature 1") +
theme_bw() +
facet_wrap(~target, scales = "free") +
theme(legend.position = "bottom",
plot.title = element_text(size = 14,hjust = 0.5),
legend.title = element_blank(),
axis.text.x = element_blank())
df %>%
filter(source != "Unlabeled Data") %>%
ggplot(aes(x = source, y = X2)) +
geom_boxplot() +
geom_jitter(aes(color = source), alpha = 0.5, width = 0.1) +
labs(title = "Distribution of Feature 2", x = "Source", y = "Feature 2") +
theme_bw() +
facet_wrap(~target, scales = "free") +
theme(legend.position = "bottom",
plot.title = element_text(size = 14,hjust = 0.5),
legend.title = element_blank(),
axis.text.x = element_blank())
df %>%
filter(source != "Unlabeled Data") %>%
ggplot(aes(x = source, y = X3)) +
geom_boxplot() +
geom_jitter(aes(color = source), alpha = 0.5, width = 0.1) +
labs(title = "Distribution of Feature 3", x = "Source", y = "Feature 3") +
theme_bw() +
facet_wrap(~target, scales = "free") +
theme(legend.position = "bottom",
plot.title = element_text(size = 14,hjust = 0.5),
legend.title = element_blank(),
axis.text.x = element_blank())
df %>%
filter(source != "Unlabeled Data") %>%
ggplot(aes(x = source, y = X4)) +
geom_boxplot() +
geom_jitter(aes(color = source), alpha = 0.5, width = 0.1) +
labs(title = "Distribution of Feature 4", x = "Source", y = "Feature 4") +
theme_bw() +
facet_wrap(~target, scales = "free") +
theme(legend.position = "bottom",
plot.title = element_text(size = 14,hjust = 0.5),
legend.title = element_blank(),
axis.text.x = element_blank())
The label distribution shows the number of samples for each class in the original data, semi-supervised data, and test data. The distribution will help us to understand the balance of the classes in the data.
In this section, we will visualize the data using pairs plots. The pairs plots will show the relationships between the features in the data. The original data will be compared with the semi-supervised data, which includes both labeled and unlabeled data.
# Semi Supervised data
#my_colors <- c("blue", "green", "purple", "orange","brown")
# Labeled data
data_ss %>%
filter(target != '-1') %>%
ggpairs(columns = 1:5,
aes(color = target, alpha = 0.5))
#scale_fill_manual(values = my_colors) +
#scale_color_manual(values = my_colors)
# Unlabeled data
data_ss %>%
filter(target == '-1') %>%
ggpairs(columns = 1:5, aes(alpha = 0.5)) +
scale_fill_manual(values = 'black') +
scale_color_manual(values = 'black')
# Test data
data_test %>%
ggpairs(columns = 1:5,
aes(color = target,
alpha = 0.5))